Raincouver Precipitation Prediction¶
Summary¶
Our project is to build a classification model to predict if there's precipitation in a day (True or False) and a regression model to predict the amount of precipitation, based on features of temperature, wind speed, direction, shortwave radiation and evapotranspiration. The best classification model in our training and testing process is SVC-RBF with hyperparameter C=10.0. It yields the best test score of 0.8625 and f1-score of 0.87 on the positive class (there's precipitation) when generalizes to the unseen data. This is a pretty high accuracy to predict whether there's rain on a particular day. The best regression model trained with the same features to predict the amount of precipitaiton is SVR with gamma=0.1 and C=1000. It produces the best score on the unseen test data of 0.6993. The accuracy is adequate. More study could be done to improve the regression model.
Introduction¶
Prediction of daily precipitation is a fundamental aspect of meteorological studies [1]. Accurate precipitation prediction is crucial for agriculture, water resources management, as well as daily activities of people. Specifically, in a geographically and climatically diverse region like Vancouver, predicting precipitation is vital for people to prepare for extreme weather events, reducing hazards and minimizing property damage.
In this project, we aim to predict the occurrence and the amount of daily precipitation in Vancouver using machine learning (ML) classification methods [2]. Specifically, our analysis utilizes a dataset containing daily precipitation information in Vancouver from 1990 to the present (i.e., 6 Nov, 2023). This dataset, sourced from Open-Meteo’s Historical Weather API [3], includes a number of parameters relevant to precipitation prediction. Key parameters include month, daily temperature measures, wind speeds, wind direction, shortwave radiation, and ET₀ reference evapotranspiration. Specifically, shortwave radiation represents the sum of solar energy received in a day; ET₀ reference evapotranspiration provides an indication of the atmospheric demand for moisture (i.e., higher relative humidity reduces ET₀ ); and month is also included as a variable since it accounts for the seasonal variations in precipitation [4]. This project may contributes insights into accurate forecast of the precipitation in Vancouver.
Exploratory data analysis¶
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import altair as alt
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
# For model training (classification)
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from datetime import datetime as dt
# For regression
from sklearn.dummy import DummyRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
import os
home_directory = os.path.expanduser("~")
os.chdir((home_directory+'/work'))
# Self-defined functions
from src.utils import encode, cross_val_model, mean_std_cross_val_scores
Data Extraction using Open-Meteo’s Historical Weather API¶
from src.get_api import get_vancouver_data
start_date = "1990-01-01"
end_date = (datetime.now() - timedelta(days = 7)).strftime('%Y-%m-%d')
precipit_df = get_vancouver_data(start_date, end_date).drop(columns = ['sunrise',
'sunset',
'weather_code',
'rain_sum',
'snowfall_sum',
'precipitation_hours']).reset_index()
print(precipit_df.shape)
print(f"Extracted data from {start_date} to {end_date}")
Coordinates 49.244285583496094°E -123.13357543945312°N Elevation 73.0 m asl Timezone b'America/Vancouver' b'PST' Timezone difference to GMT+0 -28800 s (12374, 13) Extracted data from 1990-01-01 to 2023-11-17
###Use below code block if desired to run based on manual CSV file###
# precipit_df_manual = pd.read_csv("data/van_weather_1990-01-01_2023-11-06.csv"
# ).drop(columns = ['sunrise',
# 'sunset',
# 'weather_code',
# 'rain_sum',
# 'snowfall_sum',
# 'precipitation_hours'])
# precipit_df_manual.shape
Column description¶
date: date of the recordtemperature_2m_max: Maximum daily air temperature at 2 meters above ground (°C)temperature_2m_min: Minimum daily air temperature at 2 meters above ground (°C)temperature_2m_mean: Mean daily air temperature at 2 meters above ground (°C)apparent_temperature_max: Maximum daily apparent temperature (°C)apparent_temperature_min: Minimum daily apparent temperature (°C)apparent_temperature_mean: Mean daily apparent temperature (°C)precipitation_sum: Sum of daily precipitation (including rain, showers and snowfall) (mm)wind_speed_10m_max: Maximum wind speed on a day (km/h)wind_gusts_10m_max: Maximum wind gusts on a day (km/h)wind_direction_10m_dominant: Dominant wind direction (°)shortwave_radiation_sum: The sum of solar radiaion on a given day in Megajoules (MJ/m²)et0_fao_evapotranspiration: Daily sum of ET₀ Reference Evapotranspiration of a well watered grass field (mm)
Pulling in the data and creating a classification target column is_precipitation based on sum of daily precipitation precipitation_sum. If precipitation_sum is greater than 0.01, we assign True to is_precipitation, otherwise False. The reason we use 0.01 as the threshold for assigning the class is because 0.01 is insignificant and can be used to avoid rounding issue. precipittion_sum is the regression target column.
precipit_df['is_precipitation'] = precipit_df['precipitation_sum'] > 0.01
precipit_df['date'] = pd.to_datetime(precipit_df['date'])
precipit_df['month'] = precipit_df['date'].dt.month
# precipit_df['month'] = precipit_df['date'].str[5:7].astype(int)
precipit_df = precipit_df.drop(columns='date')
precipit_df
| temperature_2m_max | temperature_2m_min | temperature_2m_mean | apparent_temperature_max | apparent_temperature_min | apparent_temperature_mean | precipitation_sum | wind_speed_10m_max | wind_gusts_10m_max | wind_direction_10m_dominant | shortwave_radiation_sum | et0_fao_evapotranspiration | is_precipitation | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.7095 | 2.0095 | 3.903250 | 1.759364 | -2.227180 | 0.155060 | 1.600000 | 18.250259 | 35.279999 | 222.906708 | 4.09 | 0.504341 | True | 1 |
| 1 | 3.0595 | 0.0095 | 1.959500 | -0.347375 | -2.789511 | -1.547469 | 1.000000 | 12.261158 | 28.799999 | 171.749664 | 3.29 | 0.467342 | True | 1 |
| 2 | 4.4095 | 1.9595 | 2.934500 | 1.055596 | -1.572106 | -0.585330 | 11.400002 | 17.555307 | 34.560001 | 141.703812 | 1.89 | 0.201307 | True | 1 |
| 3 | 7.3595 | 3.2095 | 4.917833 | 3.778481 | 0.573077 | 1.792292 | 12.599999 | 18.806337 | 38.160000 | 162.864502 | 2.43 | 0.277709 | True | 1 |
| 4 | 8.3595 | 3.6595 | 5.892833 | 4.432674 | 0.000019 | 2.062605 | 17.699999 | 32.919827 | 64.439995 | 159.804855 | 0.64 | 0.168201 | True | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12369 | 7.9305 | 1.9805 | 6.103416 | 5.815475 | -0.451266 | 3.571621 | 0.900000 | 15.175612 | 31.319998 | 287.469147 | 6.18 | 0.620820 | True | 11 |
| 12370 | 7.5305 | 1.2805 | 4.465917 | 5.156983 | -1.448553 | 2.176431 | 0.000000 | 8.707238 | 16.559999 | 306.469299 | 5.14 | 0.572728 | False | 11 |
| 12371 | 7.0805 | 1.6805 | 4.261751 | 3.856080 | -1.026966 | 1.739408 | 0.400000 | 15.827721 | 27.359999 | 322.753113 | 3.42 | 0.457760 | True | 11 |
| 12372 | 6.6805 | 0.7305 | 3.082584 | 4.175250 | -1.849204 | 0.395497 | 0.000000 | 8.287822 | 12.959999 | 330.548584 | 6.00 | 0.614756 | False | 11 |
| 12373 | 8.0305 | 1.4805 | 5.390917 | 5.480538 | -1.859046 | 2.685801 | 0.000000 | 8.669949 | 15.119999 | 323.408813 | 6.04 | 0.684940 | False | 11 |
12374 rows × 14 columns
Using .info() to examinate the data, there no missing value in each column. All features are numeric type.
precipit_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12374 entries, 0 to 12373 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 temperature_2m_max 12374 non-null float32 1 temperature_2m_min 12374 non-null float32 2 temperature_2m_mean 12374 non-null float32 3 apparent_temperature_max 12374 non-null float32 4 apparent_temperature_min 12374 non-null float32 5 apparent_temperature_mean 12374 non-null float32 6 precipitation_sum 12374 non-null float32 7 wind_speed_10m_max 12374 non-null float32 8 wind_gusts_10m_max 12374 non-null float32 9 wind_direction_10m_dominant 12374 non-null float32 10 shortwave_radiation_sum 12374 non-null float32 11 et0_fao_evapotranspiration 12374 non-null float32 12 is_precipitation 12374 non-null bool 13 month 12374 non-null int32 dtypes: bool(1), float32(12), int32(1) memory usage: 640.6 KB
Plotting the distribution of each numeric columns in below. The temperature features are more like a normal distribution. Wind speed, radiation and evapotranspiration are slightly right-skewed. The wind direction feature seems to be bimodal.
alt.data_transformers.enable("vegafusion")
numeric_cols = precipit_df.select_dtypes(include=['number']).columns.tolist()
numeric_cols_hists = alt.Chart(precipit_df).mark_bar().encode(
alt.X(alt.repeat()).type('quantitative').bin(maxbins=40),
y='count()',
).properties(
height=100,
width=200
).repeat(
numeric_cols,
columns=4
)
# Show the plot
numeric_cols_hists
In below correlation matrix, we notice that temperatue features are highly correlated with each other. We can just use one temperature parameter in our analysis. Here we choose to use temperature_2m_mean. Similarly, winds features are also highly correlated. Hence we decide to keep wind_speed_10m_max and drop the other one to avoid collinearity issue.
precipit_df[numeric_cols].corr('spearman').style.background_gradient(cmap="gist_yarg")
| temperature_2m_max | temperature_2m_min | temperature_2m_mean | apparent_temperature_max | apparent_temperature_min | apparent_temperature_mean | precipitation_sum | wind_speed_10m_max | wind_gusts_10m_max | wind_direction_10m_dominant | shortwave_radiation_sum | et0_fao_evapotranspiration | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| temperature_2m_max | 1.000000 | 0.919759 | 0.982866 | 0.993845 | 0.929232 | 0.980871 | -0.414234 | -0.262876 | -0.142114 | 0.403887 | 0.787070 | 0.857323 | 0.162502 |
| temperature_2m_min | 0.919759 | 1.000000 | 0.969675 | 0.917191 | 0.992831 | 0.962776 | -0.191277 | -0.117147 | 0.000447 | 0.343538 | 0.597534 | 0.693737 | 0.252103 |
| temperature_2m_mean | 0.982866 | 0.969675 | 1.000000 | 0.978973 | 0.973799 | 0.995476 | -0.321347 | -0.200910 | -0.076306 | 0.399426 | 0.722382 | 0.804336 | 0.209584 |
| apparent_temperature_max | 0.993845 | 0.917191 | 0.978973 | 1.000000 | 0.932340 | 0.985138 | -0.422543 | -0.312866 | -0.197508 | 0.395447 | 0.779018 | 0.843102 | 0.169212 |
| apparent_temperature_min | 0.929232 | 0.992831 | 0.973799 | 0.932340 | 1.000000 | 0.975322 | -0.233137 | -0.189025 | -0.067247 | 0.363866 | 0.620153 | 0.707343 | 0.251919 |
| apparent_temperature_mean | 0.980871 | 0.962776 | 0.995476 | 0.985138 | 0.975322 | 1.000000 | -0.344119 | -0.264200 | -0.141959 | 0.399745 | 0.722736 | 0.797548 | 0.212443 |
| precipitation_sum | -0.414234 | -0.191277 | -0.321347 | -0.422543 | -0.233137 | -0.344119 | 1.000000 | 0.564394 | 0.543813 | -0.265130 | -0.624788 | -0.588134 | -0.025687 |
| wind_speed_10m_max | -0.262876 | -0.117147 | -0.200910 | -0.312866 | -0.189025 | -0.264200 | 0.564394 | 1.000000 | 0.916572 | -0.145794 | -0.333296 | -0.282469 | -0.044116 |
| wind_gusts_10m_max | -0.142114 | 0.000447 | -0.076306 | -0.197508 | -0.067247 | -0.141959 | 0.543813 | 0.916572 | 1.000000 | -0.057211 | -0.191130 | -0.139235 | -0.056115 |
| wind_direction_10m_dominant | 0.403887 | 0.343538 | 0.399426 | 0.395447 | 0.363866 | 0.399745 | -0.265130 | -0.145794 | -0.057211 | 1.000000 | 0.462707 | 0.435294 | 0.042319 |
| shortwave_radiation_sum | 0.787070 | 0.597534 | 0.722382 | 0.779018 | 0.620153 | 0.722736 | -0.624788 | -0.333296 | -0.191130 | 0.462707 | 1.000000 | 0.976689 | -0.098498 |
| et0_fao_evapotranspiration | 0.857323 | 0.693737 | 0.804336 | 0.843102 | 0.707343 | 0.797548 | -0.588134 | -0.282469 | -0.139235 | 0.435294 | 0.976689 | 1.000000 | -0.048682 |
| month | 0.162502 | 0.252103 | 0.209584 | 0.169212 | 0.251919 | 0.212443 | -0.025687 | -0.044116 | -0.056115 | 0.042319 | -0.098498 | -0.048682 | 1.000000 |
We are dropping the temperature features that are highly correlated with temperature_2m_mean and the wind feature that are highly correlated to wind_speed_10m_max. After cleaning up the data, we are going to predict precipitation_sum and is_precipitation using the feature temperature_2m_mean, wind_speed_10m_max, wind_direction_10m_dominant, shortwave_radiation_sum, et0_fao_evapotranspiration and month.
precipit_df = precipit_df.drop(columns = ['temperature_2m_max',
'temperature_2m_min',
'apparent_temperature_max',
'apparent_temperature_min',
'apparent_temperature_mean',
'wind_gusts_10m_max'])
precipit_df
| temperature_2m_mean | precipitation_sum | wind_speed_10m_max | wind_direction_10m_dominant | shortwave_radiation_sum | et0_fao_evapotranspiration | is_precipitation | month | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3.903250 | 1.600000 | 18.250259 | 222.906708 | 4.09 | 0.504341 | True | 1 |
| 1 | 1.959500 | 1.000000 | 12.261158 | 171.749664 | 3.29 | 0.467342 | True | 1 |
| 2 | 2.934500 | 11.400002 | 17.555307 | 141.703812 | 1.89 | 0.201307 | True | 1 |
| 3 | 4.917833 | 12.599999 | 18.806337 | 162.864502 | 2.43 | 0.277709 | True | 1 |
| 4 | 5.892833 | 17.699999 | 32.919827 | 159.804855 | 0.64 | 0.168201 | True | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12369 | 6.103416 | 0.900000 | 15.175612 | 287.469147 | 6.18 | 0.620820 | True | 11 |
| 12370 | 4.465917 | 0.000000 | 8.707238 | 306.469299 | 5.14 | 0.572728 | False | 11 |
| 12371 | 4.261751 | 0.400000 | 15.827721 | 322.753113 | 3.42 | 0.457760 | True | 11 |
| 12372 | 3.082584 | 0.000000 | 8.287822 | 330.548584 | 6.00 | 0.614756 | False | 11 |
| 12373 | 5.390917 | 0.000000 | 8.669949 | 323.408813 | 6.04 | 0.684940 | False | 11 |
12374 rows × 8 columns
Plotting the scatter charts for each feature vs the regression target precipitation_sum to investigate if there is any pattern present. In preliminary investigation, we notice there is no pattern standing out for temperature, wind speed and wind direction. However, shortwave radiation and evapotranspiration shows strong negative correlation with the precipitation amount precipitation_sum. We also notice for month January, February, October, November and December. This is expected because there are more rain in winter in Vancouver.
interesting_cols = precipit_df.select_dtypes(include=['number']).columns.tolist()
interesting_cols.remove('precipitation_sum')
numeric_cols_vs_target_r = alt.Chart(precipit_df).mark_point().encode(
x='precipitation_sum',
y=alt.Y(alt.repeat()).type('quantitative'),
color='is_precipitation'
).properties(
height=200,
width=300
).repeat(
interesting_cols,
columns=3
)
# Show the plot
numeric_cols_vs_target_r
Plotting a box plot for each feature vs the classification target is_precipitation. We notice shortwave_radiation_sum and et0_fao_evapotranspiration has different means for the False and True class. Radiation and evapotranspiration mean tends to be higher when there's no precipitation than when there's precipitation.
numeric_cols_vs_target_c = alt.Chart(precipit_df).mark_boxplot().encode(
x='is_precipitation',
y=alt.Y(alt.repeat()).type('quantitative')
).properties(
height=200,
width=300
).repeat(
interesting_cols,
columns=3
)
numeric_cols_vs_target_c
We are splitting up the cleaned data into 80% training set and 20% test set.
train_df, test_df = train_test_split(precipit_df, test_size=0.2, random_state=522)
train_df.shape
(9899, 8)
X_train = train_df.drop(columns=['precipitation_sum', 'is_precipitation'])
y_train_class, y_train_regress = train_df['is_precipitation'], train_df['precipitation_sum']
X_test = test_df.drop(columns=['precipitation_sum', 'is_precipitation'])
y_test_class, y_test_regress = test_df['is_precipitation'], test_df['precipitation_sum']
The proportion of samples labeled as True is similar to the proportion of the samples labeled as False.
y_train_class.value_counts(normalize=True)
is_precipitation True 0.555713 False 0.444287 Name: proportion, dtype: float64
Classification Model¶
from datetime import datetime as dt
X_train = encode(X_train, 'month', 12)
X_test = encode(X_test, 'month', 12 )
X_train.columns
Index(['temperature_2m_mean', 'wind_speed_10m_max',
'wind_direction_10m_dominant', 'shortwave_radiation_sum',
'et0_fao_evapotranspiration', 'month', 'month_sin', 'month_cos'],
dtype='object')
from sklearn.compose import make_column_transformer
numeric_features = ['temperature_2m_mean', 'wind_speed_10m_max',
'wind_direction_10m_dominant', 'shortwave_radiation_sum',
'et0_fao_evapotranspiration', 'month_sin', 'month_cos']
drop_features = ['month']
numeric_transformer = StandardScaler()
preprocess = make_column_transformer(
(numeric_transformer, numeric_features), # scaling on numeric features
("drop", drop_features)
)
# We will evaluate a set of models for this project
models = {
"dummy": DummyClassifier(random_state=522),
"Decision Tree": DecisionTreeClassifier(random_state=522),
"KNN": KNeighborsClassifier(),
"RBF SVM": SVC(random_state=522),
"Logistic Regression": LogisticRegression(max_iter=2000, multi_class="ovr", random_state=522),
}
classification_metrics = ["accuracy", "precision", "recall", "f1"]
results_df = cross_val_model(preprocess, models, X_train, y_train_class, classification_metrics)
results_df
Done with dummy Done with Decision Tree Done with KNN Done with RBF SVM Done with Logistic Regression
| dummy | Decision Tree | KNN | RBF SVM | Logistic Regression | |
|---|---|---|---|---|---|
| mean | mean | mean | mean | mean | |
| fit_time | 0.003 | 0.038 | 0.005 | 0.366 | 0.023 |
| score_time | 0.005 | 0.005 | 0.038 | 0.129 | 0.011 |
| test_accuracy | 0.556 | 0.798 | 0.837 | 0.860 | 0.845 |
| train_accuracy | 0.556 | 1.000 | 0.886 | 0.864 | 0.846 |
| test_precision | 0.556 | 0.821 | 0.862 | 0.887 | 0.862 |
| train_precision | 0.556 | 1.000 | 0.905 | 0.890 | 0.863 |
| test_recall | 1.000 | 0.815 | 0.842 | 0.858 | 0.859 |
| train_recall | 1.000 | 1.000 | 0.888 | 0.863 | 0.858 |
| test_f1 | 0.714 | 0.818 | 0.852 | 0.872 | 0.861 |
| train_f1 | 0.714 | 1.000 | 0.896 | 0.876 | 0.861 |
Based on accuracy, test_recall, and test_precision, RBF SVM is the best performing model. We will take a look at it for future hyperparameter optimization.
Meanwhile, let us take a sidetrack and find out which features are important by looking at the feature importance via logistic regression. It is difficult to interpret feature importance of SVC fitted via RBF.
pipe = make_pipeline(preprocess, models["Logistic Regression"])
pipe.fit(X_train, y_train_class)
coefficients = pipe.named_steps['logisticregression'].coef_[0]
feature_importance = pd.DataFrame({'Feature': numeric_features, 'Importance': np.abs(coefficients)})
feature_importance = feature_importance.sort_values('Importance', ascending=True)
feature_importance.plot(x='Feature', y='Importance', kind='barh', figsize=(10, 6))
<Axes: ylabel='Feature'>
As you can see, the month and evapotranspiration are very important features.
For the chosen model of SVM RBF, let us optimize the training parameter.
param_grid = {"svc__C": 10.0**np.arange(-3,3)}
svc_pipe = make_pipeline(preprocess, models['RBF SVM'])
grid_search = GridSearchCV(svc_pipe,param_grid=param_grid,n_jobs=-1,return_train_score=True)
grid_search.fit(X_train, y_train_class)
print(grid_search.best_params_)
C_best_value = grid_search.best_params_['svc__C']
opt_pipe = make_pipeline(preprocess, SVC(C=C_best_value, random_state = 522))
opt_pipe.fit(X_train, y_train_class)
opt_pipe.score(X_test, y_test_class)
{'svc__C': 100.0}
0.8517171717171718
Best Model is the with C=10.0, since it gives highest test_score
from sklearn.metrics import classification_report
print(
classification_report(
y_test_class, opt_pipe.predict(X_test), labels=[True,False]
)
)
precision recall f1-score support
True 0.86 0.87 0.86 1335
False 0.84 0.84 0.84 1140
accuracy 0.85 2475
macro avg 0.85 0.85 0.85 2475
weighted avg 0.85 0.85 0.85 2475
In conclusion, we will go ahead and choose support vector machine (RBF) for our classification whether there is rain on a particular day. The accuracy is 0.86 (almost 90%!!)
Regression Model¶
# All the features are numerical features to be standardized. Here, we will use standard scalar.
preprocess_reg = make_column_transformer(
(numeric_transformer, numeric_features), # scaling on numeric features
("drop", drop_features)
)
# We will evaluate a set of models for this project
regressor_models = {
"Dummy Regressor": DummyRegressor(),
"Decision Tree Regressor": DecisionTreeRegressor(random_state=522),
"KNN Regressor": KNeighborsRegressor(),
"SVR": SVR(),
"Ridge": Ridge(random_state=522)
}
cv_results_reg = {}
for model_name, model in regressor_models.items():
pipe = make_pipeline(preprocess_reg, model)
cv_results_reg[model_name] = mean_std_cross_val_scores(pipe, X_train, y_train_regress, return_train_score=True)
print("Done with", model_name)
results_df_reg = pd.DataFrame(cv_results_reg).T
results_df_reg
Done with Dummy Regressor Done with Decision Tree Regressor Done with KNN Regressor Done with SVR Done with Ridge
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| Dummy Regressor | 0.005 (+/- 0.005) | 0.002 (+/- 0.001) | -0.002 (+/- 0.002) | 0.000 (+/- 0.000) |
| Decision Tree Regressor | 0.042 (+/- 0.002) | 0.002 (+/- 0.000) | 0.367 (+/- 0.047) | 1.000 (+/- 0.000) |
| KNN Regressor | 0.006 (+/- 0.002) | 0.010 (+/- 0.001) | 0.592 (+/- 0.025) | 0.729 (+/- 0.006) |
| SVR | 0.872 (+/- 0.029) | 0.254 (+/- 0.006) | 0.608 (+/- 0.022) | 0.610 (+/- 0.006) |
| Ridge | 0.024 (+/- 0.044) | 0.002 (+/- 0.001) | 0.488 (+/- 0.011) | 0.489 (+/- 0.003) |
Based on the cross-validation results above
- Dummy Regressor: Not useful for predictions.
- Decision Tree Regressor: Overfits the data (good on training, poor on testing).
- KNN Regressor: Decent performance, but a bit of overfitting.
- SVR: Best performance, good at generalizing (similar scores on training and testing).
- Ridge: Moderate performance, good at generalizing but not as accurate as SVR or KNN.
Best Choice: SVR, because it has the highest accuracy and generalizes well, but it's a little bit slower.
pipe_reg = make_pipeline(preprocess_reg, SVR(kernel='linear'))
pipe_reg.fit(X_train, y_train_regress)
coefficients = pipe_reg.named_steps['svr'].coef_[0]
feature_importance = pd.DataFrame({'Feature': numeric_features, 'Importance': np.abs(coefficients)})
feature_importance = feature_importance.sort_values('Importance', ascending=True)
feature_importance.plot(x='Feature', y='Importance', kind='barh', figsize=(10, 6))
<Axes: ylabel='Feature'>
For regression model on predicting the precipitation sum, the short wave radiation is the most important.
param_grid = {
'svr__C': np.logspace(-4, 3, 8),
'svr__gamma': np.logspace(-3, 2, 6)
}
svr_pipe = make_pipeline(preprocess_reg, SVR())
svr_random_search = RandomizedSearchCV(svr_pipe, param_distributions=param_grid, n_iter=20, n_jobs=5, return_train_score=True)
svr_random_search.fit(X_train, y_train_regress)
RandomizedSearchCV(estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['temperature_2m_mean',
'wind_speed_10m_max',
'wind_direction_10m_dominant',
'shortwave_radiation_sum',
'et0_fao_evapotranspiration',
'month_sin',
'month_cos']),
('drop',
'drop',
['month'])])),
('svr', SVR())]),
n_iter=20, n_jobs=5,
param_distributions={'svr__C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
'svr__gamma': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
return_train_score=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['temperature_2m_mean',
'wind_speed_10m_max',
'wind_direction_10m_dominant',
'shortwave_radiation_sum',
'et0_fao_evapotranspiration',
'month_sin',
'month_cos']),
('drop',
'drop',
['month'])])),
('svr', SVR())]),
n_iter=20, n_jobs=5,
param_distributions={'svr__C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
'svr__gamma': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
return_train_score=True)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['temperature_2m_mean',
'wind_speed_10m_max',
'wind_direction_10m_dominant',
'shortwave_radiation_sum',
'et0_fao_evapotranspiration',
'month_sin', 'month_cos']),
('drop', 'drop', ['month'])])),
('svr', SVR())])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['temperature_2m_mean', 'wind_speed_10m_max',
'wind_direction_10m_dominant',
'shortwave_radiation_sum',
'et0_fao_evapotranspiration', 'month_sin',
'month_cos']),
('drop', 'drop', ['month'])])['temperature_2m_mean', 'wind_speed_10m_max', 'wind_direction_10m_dominant', 'shortwave_radiation_sum', 'et0_fao_evapotranspiration', 'month_sin', 'month_cos']
StandardScaler()
['month']
drop
SVR()
pd.DataFrame(svr_random_search.cv_results_)[
[
"mean_test_score",
"param_svr__C",
"param_svr__gamma",
"mean_fit_time",
"rank_test_score",
]
].set_index("rank_test_score").sort_index().T
| rank_test_score | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean_test_score | 0.674576 | 0.656729 | 0.637264 | 0.575335 | 0.567791 | 0.497295 | 0.427917 | 0.262469 | -0.162109 | -0.201055 | -0.211936 | -0.228916 | -0.236711 | -0.272128 | -0.280475 | -0.283245 | -0.284768 | -0.284864 | -0.284907 | -0.285484 |
| param_svr__C | 100.0 | 10.0 | 100.0 | 1.0 | 1000.0 | 0.1 | 1000.0 | 1.0 | 0.001 | 1.0 | 0.1 | 0.01 | 0.001 | 0.0001 | 0.0001 | 0.0001 | 0.01 | 0.001 | 0.0001 | 0.0001 |
| param_svr__gamma | 0.1 | 1.0 | 1.0 | 1.0 | 0.001 | 0.1 | 10.0 | 0.001 | 0.1 | 100.0 | 10.0 | 0.001 | 0.01 | 0.1 | 0.01 | 1.0 | 100.0 | 10.0 | 0.001 | 100.0 |
| mean_fit_time | 3.491547 | 2.31292 | 12.499724 | 1.123653 | 1.598251 | 1.123495 | 10.111327 | 1.312524 | 1.295528 | 2.722705 | 1.888045 | 1.314482 | 1.346031 | 1.394501 | 1.247943 | 1.33649 | 1.908747 | 1.909253 | 1.383661 | 2.102351 |
print('SVR Best Parameters:')
print(svr_random_search.best_params_)
print()
print('SVR Best Score:')
print(svr_random_search.best_score_)
SVR Best Parameters:
{'svr__gamma': 0.1, 'svr__C': 100.0}
SVR Best Score:
0.6745764235017668
Evaluate the SVR model based on the best parameters
best_pipe = make_pipeline(preprocess_reg, SVR(gamma=0.1, C=1000))
best_pipe.fit(X_train, y_train_regress)
best_pipe.score(X_test, y_test_regress)
#Done evaluation#
0.7023144745872669
Discussion¶
The best classification model in our training and testing process is SVC-RBF with hyperparameter C=10.0. It yields the best test score of 0.8625 and f1-score of 0.87 on the positive class (precipitation occurs on the day) when generalizes to the unseen data. This is a pretty high accuracy to predict whether there's rain on a particular day.
The best regression model trained with the same features to predict the amount of precipitaiton is SVR with gamma=0.1 and C=1000. The score on the unseen test data is 0.6993, which is adequate. This result suggest more study (e.g., adding new features) could be done to improve the regression model.
Reference¶
[1] New, Mark, et al. "Precipitation measurements and trends in the twentieth century." International Journal of Climatology: A Journal of the Royal Meteorological Society 21.15 (2001): 1889-1922.
[2] Ortiz-García, E. G., S. Salcedo-Sanz, and C. Casanova-Mateo. "Accurate precipitation prediction with support vector classifiers: A study including novel predictive variables and observational data." Atmospheric research 139 (2014): 128-136.
[3] Zippenfenig, P. (2023). Open-Meteo.com Weather API [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.7970649
[4] Pal, Jeremy S., Eric E. Small, and Elfatih AB Eltahir. "Simulation of regional‐scale water and energy budgets: Representation of subgrid cloud and precipitation processes within RegCM." Journal of Geophysical Research: Atmospheres 105.D24 (2000): 29579-29594.